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ABSTRACT 

We  describe  the  proposed  NYU  Ultracomputer,  a  shared  memory  MIMD 
parallel  machine  composed  of  thousands  of  autonomous  processing  ele- 
ments. This  machine  uses  an  enhanced  message  switching  network  having 
the  topology  of  an  ft-network  to  approximate  the  ideal  of  conflict-free  ac- 
cess to  a  common  memory  and  to  implement  efficiently  a  new  fetch-  and- 
add  synchronization  primitive.  We  outline  the  hardware  required  to  con- 
struct such  a  system,  consisting  of  4096  processors,  using  1990  technology 
and  refer  to  other  work  indicating  how  the  goal  of  a  distributed  operating 
system  free  from  serial  bottlenecks  can  be  achieved  by  employing  the 
fetch-and-add  primitive.  Finally,  we  compare  the  Ultracomputer  project 
with  other  research  in  parallel  processing. 


:This  work  was  supported  by  DOE  grant  DE-AO02-76ER03077  and  by  NSF  grant  NSF- 
MCS79-21258. 
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shared  variable  and  many  fetch-and-add  operations  simultaneously  address  V,  the 
effect  of  these  operations  is  exactly  what  it  would  be  if  they  occurred  in  some 
(unspecified)  serial  order,  i.e.  V  is  modified  by  the  appropriate  total  increment 
and  each  operation  yields  the  intermediate  value  of  V  corresponding  to  its  posi- 
tion in  this  order.  The  following  example  illustrates  the  semantics  of  fetch-and- 
add:  Assume  V  is  a  shared  variable,  if  PEi  executes 

ANSi  <--F&A(V,ei)   , 
and  if  PEj  simultaneously  executes 
ANSj  <-- F&A(V,ej)   , 
and  if  V  is  not  simultaneously  updated  by  yet  another  processor,  then  either 

ANSi  <--  V        or      ANSi  <--  V+ej 

ANSj  <--  V+ei  ANSj  <-  V 

and,  in  either  case,  the  value  of  V  becomes  V+ei+ej. 

For  another  example  consider  several  PEs  concurrently  applying  fetch-and- 
add,  with  an  increment  of  1,  to  a  shared  array  index.  Each  PE  obtains  an  index 
to  a  distinct  array  element  (although  one  cannot  say  which  element  will  be 
assigned  to  which  PE).  Furthermore,  the  shared  index  receives  the  appropriate 
total  increment. 

Section  3  presents  a  hardware  design  that  realizes  fetch-and-add  without  sig- 
nificantly increasing  the  time  required  to  access  shared  memory  and  that  realizes 
simultaneous  fetch- and- adds  updating  the  same  variable  in  a  particularly  efficient 
manner. 

2.3.   The  Power  of  Fetch-and-add 

If  the  fetch-and-add  operation  is  available,  we  can  perform  many  important 
algorithms  in  a  completely  parallel  manner,  i.e.  without  using  any  critical  sec- 
tions. For  example,  as  indicated  above,  concurrent  executions  of  F&A(I,1)  yield 
consecutive  values  that  may  be  used  to  index  an  array.  If  this  array  is  interpreted 
as  a  (sequentially  stored)  queue,  the  values  returned  may  be  used  to  perform  con- 
current inserts;  analogously  F&A(D,1)  may  be  used  for  concurrent  deletes.  An 
implementation  may  be  found  in  Gottlieb,  Lubachevsky,  and  Rudolph  [S3]2  who 
also  indicate  how  such  techniques  can  be  used  to  implement  a  totally  decentral- 
ized operating  system  scheduler.  We  are  unaware  of  any  other  completely  paral- 
lel solutions  to  this  problem.  To  illustrate  the  nonserial  behavior  obtained,  we 
note  that  given  a  single  queue  that  is  neither  empty  nor  full,  the  concurrent  exe- 
cution of  thousands  of  inserts  and  thousands  of  deletes  can  all  be  accomplished  in 
the  time  required  for  just  one  such  operation.  Other  highly  parallel  fetch-and- 
add-based  algorithms  appear  in  Kalos  [81],  Kruskal  [81],  and  Rudolph  [82]. 


2  As  explained  in  Gottlieb  and  Kruskal  [81],  the  replace-add  primitive  defined  in  Gottlieb,  Lu- 
bachevsky, and  Rudolph  [83]  and  used  in  several  of  our  earlier  reports  is  essentially  equivalent  to 
fetch-and-add. 
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3.  Machine  Design 

In  this  section  we  sketch  the  design  of  the  NYU  Ultracomputer,  a  machine 
that  appears  to  the  user  as  a  paracomputer.  A  more  detailed  hardware  descrip- 
tion as  well  as  a  justification  of  various  design  decisions  and  a  performance 
analysis  of  the  communication  network  can  be  found  in  Gottlieb,  Grishman,  et  al. 
[83].  The  Ultracomputer  uses  a  message  switching  network  with  the  topology  of 
Lawrie's  [75]  ft-network  to  connect  N  =  2D  autonomous  PEs  to  a  central  shared 
memory  composed  of  N  memory  modules  (MMs).  Thus,  the  direct  single  cycle 
access  to  shared  memory  characteristic  of  paracomputers  is  approximated  by  an 
indirect  access  via  a  multicycle  connection  network. 

3.1.  Network  Design 

For  machines  with  thousands  of  PEs  the  communication  network  is  likely  to 
be  the  dominant  component  with  respect  to  both  cost  and  performance.  The 
design  to  be  presented  achieves  the  following  objectives  and  we  are  unaware  of 
any  significantly  different  design  that  also  attains  these  goals. 

(1)  Bandwidth  linear  in  N,  the  number  of  PEs. 

(2)  Latency,  i.e.  memory  access  time,  logarithmic  in  N. 

(3)  Only  0(N  log  N)  identical  components. 

(4)  Routing  decisions  local  to  each  switch;  thus  routing  is  not  a  serial  bottleneck 
and  is  efficient  for  short  messages. 

(5)  Concurrent  access  by  multiple  PEs  to  the  same  memory  cell  suffers  no  per- 
formance penalty;  thus  interprocessor  coordination  is  not  serialized. 

3.1.1.  fl-network  Enhancements  The  manner  in  which  an  fi-network  can 
be  used  to  implement  memory  loads  and  stores  is  well  known  and  is  based  on  the 
existence  of  a  (unique)  path  connecting  each  PE-MM  pair.  We  enhance  the  basic 
H-network  design  as  follows: 

(1)  The  network  is  pipelined,  i.e.  the  delay  between  messages  equals  the  switch 
cycle  time  not  the  network  transit  time.  (Since  the  latter  grows  logarithmi- 
cally, nonpipelined  networks  can  have  bandwidth  at  most  N/log  N.) 

(2)  The  network  is  message  switched,  i.e.  the  switch  settings  are  not  maintained 
while  a  reply  is  awaited.  (The  alternative,  circuit  switching,  is  incompatible 
with  pipelining.) 

(3)  Queues  are  associated  with  each  switch  to  enable  concurrent  processing  of 
requests  for  the  same  port.  (The  alternative  adopted  by  Burroughs  [79]  of 
killing  one  of  the  two  conflicting  requests  limits  bandwidth  to  0(N/log  N), 
see  Kruskal  and  Snir  [82].) 

When  concurrent  loads  and  stores  are  directed  at  the  same  memory  location 
and  meet  at  a  switch,  they  can  be  combined  without  introducing  any  delay  (see 
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Klappholtz  [80],  and  Gottlieb,  Lubachevsky,  and  Rudolph  [83]).  Combining 
requests  reduces  communication  traffic  and  thus  decreases  the  lengths  of  the 
queues  mentioned  above,  leading  to  lower  network  latency  (i.e.  reduced  memory 
access  time).  Since  combined  requests  can  themselves  be  combined,  the  network 
satisfies  the  key  property  that  any  number  of  concurrent  memory  references  to 
the  same  location  can  be  satisfied  in  the  time  required  for  one  central  memory 
access.  It  is  this  property,  when  extended  to  include  fetch-and-add  operations, 
that  permits  the  bottleneck-free  implementation  of  many  coordination  protocols. 

3.1.2.  Implementing  Fetch-and-add  By  including  adders  in  the  MMs,  the 
fetch-and-add  operation  can  be  easily  implemented:  When  F&A(X,e)  reaches  the 
MM  containing  X,  the  value  of  X  and  the  transmitted  e  are  brought  to  the  MM 
adder,  the  sum  is  stored  in  X,  and  the  old  value  of  X  is  returned  through  the  net- 
work to  the  requesting  PE.  Since  fetch-and-add  is  our  sole  synchronization  primi- 
tive (and  is  also  a  key  ingredient  in  many  algorithms),  concurrent  fetch-and-add 
operations  will  often  be  directed  at  the  same  location.  Thus,  as  indicated  above, 
it  is  crucial  in  a  design  supporting  large  numbers  of  processors  not  to  serialize  this 
activity. 

Enhanced  switches  permit  the  network  to  combine  fetch-and-adds  with  the 
same  efficiency  as  it  combines  loads  and  stores.  When  two  fetch-and-adds 
referencing  the  same  shared  variable,  say  F&A(X,e)  and  F&A(X,f),  meet  at  a 
switch,  the  switch  forms  the  sum  e+f,  transmits  the  combined  request 
F&A(X,e-l-f),  and  stores  the  value  e  in  its  local  memory.  When  the  value  Y  is 
returned  to  the  switch  in  response  to  F&A(X,e+f),  the  switch  transmits  Y  to 
satisfy  the  original  request  F&A(X,e)  and  transmits  Y+e  to  satisfy  the  original 
request  F&A(X,f).  Assuming  that  the  combined  request  was  not  further  com- 
bined with  yet  another  request,  we  would  have  Y  =  X;  thus  the  values  returned 
by  the  switch  are  X  and  X+e,  thereby  effecting  the  serialization  order 
"F&A(X,e)  followed  immediately  by  F&A(X,f)".  The  memory  location  X  is  also 
properly  incremented,  becoming  X+e+f.  If  other  fetch-and-add  operations 
updating  X  are  encountered,  the  combined  requests  are  themselves  combined, 
and  the  associativity  of  addition  guarantees  that  the  procedure  gives  a  result  con- 
sistent with  the  serialization  principle. 

3.2.  Local  Memory  at  each  PE 

The  negative  impact  of  the  network  latency  can  be  partially  mitigated  by  pro- 
viding each  PE  with  a  local  memory  in  which  private  variables  reside  and  into 
which  read-only  shared  data  (in  particular,  program  text)  may  be  copied.  One 
common  design  for  parallel  machines  is  to  implement  a  separately  addressable 
local  memory  at  each  PE,  imposing  upon  compilers  and  loaders  the  onus  of 
managing  the  two  level  store. 

The  alternative  approach,  which  we  intend  to  adopt,  is  to  implement  the 
local  memory  as  a  cache.  Experience  with  uniprocessor  systems  shows  that  a 
large  cache  can  capture  up  to  95%  of  the  references  to  cacheable  variables  (see 
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Kaplan  and  Winder  [73]).  Moreover,  a  cache  based  system  supports  dynamic 
location  of  segments,  and  thus  permits  shared  read-write  segments  to  be  cached 
during  periods  of  exclusive  read-only  access  (see  Gottlieb,  Grishman,  et  aj  [83]). 

3.3.  Machine  packaging 

We  conservatively  estimate  that  a  machine  built  in  1990  would  require  four 
chips  for  each  PE,  nine  chips  for  each  1  megabyte  MM  and  two  chips  for  each  4- 
input-4-output  switch.  Thus,  a  40%  processor  machine  would  require  roughly 
65,000  chips,  not  counting  the  I/O  interfaces.  Note  that  the  chip  count  h  still 
dominated,  as  in  present  day  machines,  by  the  memory  chips,  and  that  only  19% 
of  the  chips  are  used  for  the  network.  Nevertheless,  most  of  the  machine  volume 
will  be  occupied  by  the  network,  and  its  assembly  will  be  the  dominant  system 
cost,  due  to  the  nonlocal  wiring  required.  Our  preliminary  estimate  is  that  the 
PEs,  network,  and  MMs  would  occupy  a  5'x5'xl0'  (air  cooled)  enclosure  (see 
Bianchini  and  Bianchini  [82]). 

4.  Other  Research 

In  this  section  we  show  how  our  work  relates  to  other  research  in  parallel 
processing. 

4.1.  Alternate  Machine  Models 

Gottlieb,  Grishman,  et  al.  [83]  discuss  systolic  processors,  vector  machines, 
dataflow  architectures,  and  message  passing  designs  and  explain  the  choice  of  an 
MIMD  shared  memory  machine.  We  summarize  their  remarks  as  follows. 

Systolic  processor  designs  (Kung  [80])  have  a  significant  and  growing  impact 
on  signal  processing  but  are  less  well  suited  for  computations  having  complex  con- 
trol and  data  flow.  We  expect  to  use  VLSI  systolic  systems  for  subcomponents  of 
the  Ultracomputer  having  regular  control  and  data  flow,  specifically  for  the  com- 
bining queues  found  in  the  network  switches. 

Current  vector  supercomputers  may  be  roughly  classified  as  SIMD  shared 
memory  machines  (cf .  Stone  [80])  that  achieve  their  full  power  only  on  algorithms 
dominated  by  vector  operations.  However,  some  problems  (especially  those  with 
many  data  dependent  decisions,  for  example  particle  tracking)  appear  to  resist 
effective  vectorization  (Rodrigue  et  al.  [80]).  Our  simulation  studies  have  shown 
that  the  Ultracomputer  is  effective  for  particle  tracking  (Kalos  et  al.  [81])  as  well 
as  for  the  vectorizable  fluid-type  problems  (Korn  and  Rushfield  [83])  also  men- 
tioned by  Rodrigue  et  al. 

Dataflow  researchers,  joined  by  advocates  of  functional  programming,  have 
stressed  the  advantages  of  a  applicative  programming  language.  We  discuss  the 
language  issue  below  and  note  that  Gottlieb  and  Schwartz  [82]  have  shown  how  a 
paracomputer  can  execute  a  dataflow  language  with  maximal  parallelism. 
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We  subdivide  message  passing  architectures  based  on  whether  or  not  the 
interconnection  topology  is  visible  to  the  programmer.  If  it  is  visible,  then  by 
tailoring  algorithms  to  the  topology,  very  high  performance  can  be  obtained  as  in 
the  Homogeneous  machine  (Seitz  [82])  and  the  original  Ultracomputer  design 
(Schwartz  [80]).  However,  we  found  such  a  machine  to  be  more  difficult  to  pro- 
gram than  one  in  which  the  entire  memory  is  available  to  each  PE.  If  the  topol- 
ogy is  hidden  by  having  the  routing  performed  automatically,  a  loosely  coupled 
design  emerges  that  is  well  suited  for  distributed  computing  but  is  less  effective 
when  the  PEs  are  to  cooperate  on  a  single  problem. 

4.2.  Languages 

Our  applications  have  been  programmed  in  essentially  trivial  extensions  of 
Pascal,  C,  and  FORTRAN  (primarily  the  latter).  One  should  not  conclude  that 
we  view  these  languages  as  near  optimal  or  even  satisfactory.  Indeed,  we  view 
the  progress  we  have  achieved  using  FORTRAN  as  a  worst  case  bound  on  the 
results  obtainable.  We  do,  however,  consider  it  a  strength  of  the  paracomputer 
design  that  it  imposes  few  requirements  on  the  language.  Even  simple  extensions 
of  old  serial  languages  permit  useful  work  to  be  accomplished.  Of  course  more 
useful  work  can  be  accomplished  when  better  language  vehicles  are  available  and 
we  expect  our  language  related  research  efforts  to  increase  as  we  study  a  broader 
range  of  applications. 

4.3.  Granularity  of  Parallelism 

The  Ultracomputer  emphasizes  a  relatively  coarse  grain  of  parallelism  in 
which  the  units  to  be  executed  concurrently  consist  of  several  high  level  language 
statements.  This  should  be  compared  with  the  fine  grained  dataflow  approach  in 
which  the  corresponding  units  are  individual  machine  operations  and  with  the 
extremely  coarse  grained  approach  taken  by  current  multiprocessor  operating  sys- 
tems in  which  the  units  are  entire  processes. 

4.4.  Processor  Count  and  Performance 

The  few  thousand  PEs  specified  for  the  Ultracomputer  represents  an  inter- 
mediate value  for  this  parameter.  On  one  end  of  the  scale  we  find  architectures 
that  support  only  a  few  dozen  PEs;  on  the  other  end  sits  designs  specifying  mil- 
lions of  PEs.  When  the  PE  count  is  modest,  as  in  the  SI  (Livermore  Labs  [79]), 
a  full  crossbar  inter  conn  etion  network  is  possible  as  is  the  use  of  high  speed,  high 
power,  low  density  logic.  When  the  PE  count  is  massive,  as  in  NON-VON  (Shaw 
[82]),  many  PEs  must  share  a  single  chip.  This  restricts  the  interconnection  pat- 
tern chosen  due  to  the  I/O  limitation  inherent  in  VLSI.  Often  the  tree  network  is 
chosen  since,  for  any  k,  2k  PEs  can  be  packaged  together  with  only  4  external 
lines.  Moreover,  there  is  not  likely  to  be  sufficient  chip  area  for  each  PE  to  store 
its  own  program  or  to  contain  instruction  decoding  logic.  Thus  an  SIMD  design 
seems  natural. 
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When  the  PE  count  lies  between  ther.s  extremes,  as  in  the  IJtoacqpjputer, 
high  density  logic  is  required  and  a  crossbar  is  not  feaiiblc.  Bo;v~v?r,  an  ft- 
network  (which  avoids  die  tree's  bottleneck  r,<«r  the  roof)  is  possible^,. is.,  an 
MIMD  design.  sstecriq::  3W 

4.5.  Networks  ^ 

As  indicated  in  section  3,  the  fl-network  permits  several  important  objectives 
to  be  achieved  and  thus  appears  to  be  a  favorable  choice.  However,  more  general 
Banyan  networks  (Goke  and  Lipovski  [73])  also  share  these  favorable  characteris- 
tics and  remove  the  restriction  that  the  FEs  and  MMs  be  equal  in  number.'  In 
essence  we  have  (tentatively)  selected  the  simplest  Banyan  and  also  have  not  pur- 
sued the  possibility  of  dynamically  reconfiguring  the  network.  This  last  possibility 
has  been  studied  by  the  PASSM  (Siegle  et  al.  [79]),  TRAC  (Sejnowsky  et  al, 
[80]),  and  Blue  Chip  (Snyder  [82])  projects.  c£ 

, .    .         .  .  . 

4.6.  Compilers  and  Local  Memory 

By  specifying  a  cache  rather  than  a  (separately  addressed)  local  memory.,  we 
have  removed  from  the  compiler  the  task  of  managing  a  two  level  address  space.3 
However,  for  some  problems,  advanced  compiler  optimization  techniques  can 
better  the  results  obtained  by  caching.  More  significantly,  local  memory  can  be 
used  in  addition  to  caching.  Kuck  and  his  collegues  at  Illinois  have  studied  the 
compiler  issues  for  several  years  (Kuck  and  Padua  [79])  and  the  Cedar  pi  eject 
(Gajski  et  al.  [83])  intends  to  use  this  expertise  to  effectively  manage  a  sophisti- 
cated multilevel  memory  hierarchy. 

5.   Conclusion 

Until  now  high  performance  machines  have  been  constructed  from  increas- 
ingly complex  hardware  structures  and  ever  more  exotic  technology.  It  is  our 
belief  that  the  NYU  Ultracomputer  approach  offers  a  simpler  alternative,  which  is 
better  suited  to  advanced  VLSI  technology:  High  performance  is  obtained  by 
assembling  large  quantities  of  identical  computing  components  in  a  pamcuiarly 
effective  manner.  The  4096  PE  Ultracomputer  that  we  envision  has  roughly  the 
same  component  count  as  found  in  today's  large  machines.  The  number  of  dif- 
ferent component  types,  however,  is  much  smaller,  each  component  being  a 
sophisticated  one  chip  VLSI  system.  Such  machines  wo'id  m  three  orders  c f  o mag- 
nitude faster  and  would  have  a  main  storage  three  orders  of  magnitude  larger 
than  present  day  machines. 

Our  instruction  level  simulations  indicate  that  the  NYU  Ultracomputer  would 
be  an  extremely  powerful  computing  engine  for  large  applications.  The  low  coor- 
dination overhead  and  the  large  memory  enable  us  to  use  efficiently  the  high 


'Strictly  speaking  this  is  false  since  the  PE  registers  constitute  a  second  level. 
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--^tegree  of  fjarallelism  available^  Finally,  our  programming  experience  using  the 
Simula  tor  indicates  that  the  manual  translation  of  serial  codes  into  parallel  Ultra- 
'    Computer  codes  is  a  manageable  task. 

We  have  not  emphasized  language  issues,  reconfigurability,  or  sophisticated 
memory  management.  Advances  in  these  areas  will  doubtless  lead  to  even 
greater  performance. 

Tq  demonstrate  further  the  feasibility  of  the  hardware  and  software  design 
we  intend  to  construct  a  64  PE  prototype  that  will  use  commercial  microproces- 
sors and  memories  together  with  custom-built  VLSI  components  for  the  network. 
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